Extremely large scale simulation of a Kardar-Parisi-Zhang model using graphics cards 
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The octahedron model introduced recently has been implemented onto graphics cards, which 
permits extremely large scale simulations via binary lattice gases and bit coded algorithms. We 
confirm scaling behavior belonging to the two-dimensional Kardar-Parisi-Zhang universality class 
I and find a surface growth exponent: f3 — 0.24f5(f5) on 2 17 x 2 17 systems, ruling out /3 = f/4 

suggested by field theory. The maximum speedup with respect to a single CPU is 240. The 
steady state has been analyzed by finite size scaling and a growth exponent a = 0.393(4) is found. 
Correction-to-scaling exponents are computed and the power-spectrum density of the steady state 
is determined. We calculate the universal scaling functions, cumulants and show that the limit 
distribution can be obtained by the sizes considered. We provide numerical fitting for the small and 
large tail behavior of the steady state scaling function of the interface width. 

'"q . PACS numbers: 05.70.Ln, 05.70.Np, 82.20.Wt 
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I. INTRODUCTION 



The research of the nonlinear stochastic differential equation and universality class introduced by Kardar, Parisi 
and Zhang (KPZ) [l[ is in the forefront of interest nowadays again @, Q. This is largely due to the progress in exact 



solutions for various one-dimensional realizations and initial conditions (see for ex amp le |4|-[8j]). This equation can 
describe the dynamics of simple growth processes in the thermodynamic limit [tl llCf . randomly stirred fluid 
directed polymers in random media [l2j dissipative transport fl3L [l4| , and the magnetic flux lines in superconductors 
(l5j . Due to the mapping onto the Asymmetric Exclusion Process (ASEP) [l6[ in one dimension it is also a fundamental 
model of non-equilibrium particle system pjj ■ The KPZ equation specifies the evolution of the height function h(x, t) 
in the d dimensional space 



dthhc, t) = v + cr 2 V ' 2 /i(x, t) + A(V/i(x, t)Y + rjhc, t) . (1) 

CO ■ 

Here v and A are the amplitudes of the mean and local growth velocity, 02 is a smoothing surface tension coefH- 
■ cient and 77 roughens the surface by a zero-average, Gaussian noise field exhibiting the variance (rj(x, t)ry(x',t')) = 
I 2D5 d (x — x')(t — t'). The letter D denotes the noise amplitude and () means the distribution average. The equation 
• is solvable in (1 + 1) d due to the Galilean symmetry 1 , [Til ] and an incidental fluctuation-dissipation symmetry [l2j], 
while in higher dimensions approximations are available only. The model exhibits diverging correlation length, thus 
scale a invariance, that can be understood by the steady current in the ASEP model corresponding to the up-down 
anisotropy of KPZ. Therefore KPZ equation has been investigated by renormalization techniques 18-20]. As the 
^—1 , result of the competition of roughening and smoothing terms, models described by the KPZ equation exhibit a rough- 
ening phase transition between a weak-coupling regime (A < A c ) and a strong coupling phase. The strong coupling 
. 5^ I fixed point is inaccessible by perturbative renormalization group (RG) method. Therefore, the KPZ phase space has 
been the subject of controversies for a long time (2l| - [23| and the strong coupling fixed point has been located by 
^ ' non-perturbative RG very recently (24|. 

Discretized versions of KPZ have also been studied a lot ( l2llW27| . for a review see Q). Recently we have shown 
[28l |29| the mapping between the KPZ surface and the ASEP |30l. l3l| can straightforwardly be extended to higher 
dimensions. In 2+1 dimensions the mapping is just the simple extension of the rooftop model to the octahedron 
model as can be seen on Fig. 2 of [Hj]. The surface built up from the octahedra can be described by the edges meeting 
in the up/down middle vertexes. The up edges in the x or y directions are represented by <J x i y — +l-s, while the 
down ones by cr x / y — 1 in the model. This can also be understood as a special 2d cellular automaton [32J with the 



1 The invariance of Eq. JTJ under an infinitesimal tilting of the interface, resulting in a + z = 2 
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generalized Kawasaki updating rules 
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with probability p for attachment and probability q for detachment. We have confirmed that this mapping, using 
the parametrization: A = 2p/(p + q) — 1, reproduces the one-point functions of the continuum model. This kind 
of generalization of the ASEP model can be re gard ed as the simplest candidate for studying KPZ in d > 1: a one- 
dimensional model of self-reconstructing d-mers |33| diffusing in the d-dimensional space. Furthermore this lattice gas 
can be studied by very efficient simulation methods. 

Now we implement dynamic, bit-coded simulations of the conserved lattice gas models for graphics cards (GPUs), 
allowing very large system sizes L x L. The surface heights are reconstructed from the slopes 

i 3 

h ij = E^'^ + E "^'^ (3) 

1 = 1 k=l 

and the squared interface width 

w2 ( L ^ = j2 E - (i E M*)) a • ( 4 ) 

was calculated at certain sampling times (t). In the absence of any characteristic length, growth processes are expected 
to follow power-law behavior and the surface can be described by the Family- Vicsek [34j j scaling law: 

W(L,t)^L a f(t/L z ), (5) 

with the universal scaling function f(u) 

tt u )~{ const, if «»1 (6) 

Here a is the roughness exponent of the stationary regime, when the correlation length has exceeded the system size 
L; and ft is the growth exponent, describing the intermediate time behavior. The dynamical exponent z is just the 
ratio 

z = a/p. (7) 



II. BIT-CODED GPU SIMULATIONS 



The height of each surface site is thoroughly determined by two slopes, along the x and y axes respectively, whose 
absolute values are restricted to unity. Thus at each site two bits of information are required, hence a chunk of 4 x 4 
sites is encoded in one 32-bit word. 

Two slightly different layers of parallelization are used that reflect the two layered compute architecture provided 
by GPUs [351 ]: not communicating blocks at device level and communicating threads at block level. Both layers use 
domain decomposition with dead borders, i.e. conflicts at the subsystem borders are avoided simply by not updating 
them (see Fig.[IJ. A random translation is applied to the origin of the decomposition periodically to preserve statistics. 
To avoid having to deal with non 32-bit aligned memory these translations are restricted to multiples of four sites. 

The complete system is stored in global device memory, each block cell is copied into the block-local shared memory 
for precessing. Thus moving the origin of the device level decomposition results in cutting out the proper piece of the 
system taking the periodic boundary conditions into account. Moving the origin at each Monte Carlo step (MCS), 
i.e. by one overall update of the system, proved to be sufficient and the results are undistorted. 

The size of a thread cell is set to be 8 x 8 sites, the smallest possible choice, to maximize the number of threads 
per block. Due to this small subsystem size a new origin for the block level decomposition is picked at each update 
attempt, thus 64 times per MCS. Additionally the borders are not dead but delayed, i.e. when a thread picks the 
border of its cell to change it waits for the threads updating bulk sites to finish their updates. Corner sites are further 
delayed. 

If a thread hits a site belonging to a block border it does nothing, slightly reducing the actual number of update 
attempts per MCS. This is a minor effect, the ratio of block border to overall system size is approximately 1/128, and 
impacts the pre-factors of the scaling, but not exponents. 
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FIG. 1: Sketch of the dead border domain decomposition scheme employed. Lattice sites are indicated by dots, where the grey 
dots represent inactive sites. Since only two slopes relating any sites to its nearest neighbors are stored off-site only two edges 
are inactive. 



For random numb er g eneration each thread uses a 32-bit, linear congruential random number generator (LCRNG) 
with different seed [36[. Similar generators were previously successfully applied in GPU simulations of the Ising 
model [371 ] . Depending on the system size the generators have to be periodically reseeded, which potentially introduces 
the same correlations as using multiple generators in parallel. However, since no deviations from the earlier CPU 
results have been observed, we assume this to not disturb the statistics. Correlations resulting from parallel usage 
could only have local effects on the updates of a block cell. Moving the origin of the block level decomposition should 
effectively destroy such correlations. By the same argument reseeding the generators, using a Mersenne Twister 
generator |38[ running on the CPU, has no negative effect at all. Part of the results were double-checked, employing 
a skip-ahead 64-bit LCRNG [3!| with no need for reseeding. 

The simulations were performed on a C2070 GPU with 6GB graphics memory, which allowed for a maximum system 
size: 2 17 x 2 17 (4GB of memory required). Figure [2] shows benchmark results for the simulation. 
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FIG. 2: (Color online) Time in seconds for one MCS on a C2070 (black bullets, left axis). Speedup over a non parallel 
implementation on a Core i5 750@3GHz (red squares, right axis). The jump in speedup between log 2 L = 12 and 13 results 
from the system exceeding the CPUs L3 cache at this point. The maximum speedup factor achieved was roughly 240. 



The benchmarks only consider bare simulation times and exclude the time needed to transfer data between host 
and device. For large sizes, where the system exceeds the CPUs L3 cache, the performance drops significantly, this 
could be avoided in a CPU implementation using domain decomposition designed to optimize memory access of the 
CPU cache. 



4 



0.246 



0.244 



0.242 



0.24 



0.238 



0.236 









12 










13 

— 16 










17 


■1 




















- 1 f 




\ 

\ 





0.02 



0.04 
1/t 1 ' 2 



0.06 



0.08 



FIG. 3: (Color online) Local slopes of the surface growth for different sizes (L — 2 12 , 2 13 , 2 16 , 2 17 ). Averaging was done over 
20-100 independent runs. 



III. SURFACE GROWTH SCALING 



We have run 10 - 1000 independent simulations for sizes: L = 2 8 , 2 9 , 2 10 , 2 11 , 2 12 , 2 13 , 2 16 , 2 17 by starting from half 
filled (striped) lattice gases. This causes an intrinsic width of the initial zig-zag surface state, with W 2 (L, 0) = 1/4, 
a leading order constant correction to scaling, that we subtract at the beginning of the scaling analysis. The time 
between measurements increases exponentially 



U+i = (U + 10) • e m , with m > 0, t = 0, 



(8) 



where the program calculates and writes out the width of the surface. We used m = 0.01 to study the growth in 
larger systems, and m = 0.001 or m = 0.0001 to collect more data points in the steady state. By the scaling analysis 
we disregarded the initial time region: t < t m i c ~ 50, when basically an uncorrelated, random deposition process goes 
on. The growth is expected to follow simple scaling (|6]) asymptotically and we assume corrections in the form 



W(t,L -+oo) =^(1 + M 0O +M 01 -) 



(9) 



For finite system, when the correlation length exceeds L, the growth crosses over to a saturation, with the expected 
scaling behavior 



W(t ->■ oo, L) = aL a (l + clqL^ + ai L u 



(10) 



To see the corrections clearly we determined the effective exponent of /3, as the discretized, logarithmic derivative 
of® 



Mt) = 



In W(t, L oo) - In W(t', L -> oo) 
ln(f) - ln(t') 



(11) 



using t/t 1 — 2. As Fig.[3]shows the f3 e g(t) curves converge to the same asymptotic value for different sizes as 1/t — > 0, 
albeit for smaller systems the fluctuations are larger and the scaling breaks down as £(i) ~ L. One can read-off the 
most precise /3 — 0.2415(15) estimate from the largest system (L = where the simulations were followed up to 
t = 70.000 MCS. This agrees with our former estimates for this model [28, 29], but now the error margin is sufficiently 
small to exclude a convergence to the field theoretical value j3 = 1/4 [4(jj via the analytic corrections ((9|) . 

Following the subtraction of the constant leading-order term, corresponding to cf>o — —/3, bo = 1/2, the remaining 
corrections are seemingly small and the oscillations hinder to fit them out very precisely. We determined the next 
leading order correction exponent by fitting with the from (|12|) 

Mt) =/3 + Mi^ 1 , (12) 

in the time window t > t m i c and before the saturation region. From the largest system fit we obtained: (f>i = f.05 and 
b x = -0.f2. 
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FIG. 4: (Color online) Local slopes of the roughness exponent for different sizes. The line shows a fit with the form (|14p . Inset: 
Surface width of the stationary state as the function of linear system size. The line corresponds to a fit with the form (|10|) . 



On Fig. [3] we can observe that the local slopes do not change for late times, therefore assuming a W oc i - 2415 
asymptotic scaling we determined the probability distribution around this law for the largest size: L 17 for 1300 < t < 
70000 MCS. We calculated the distribution of y = W(t)/t - 2415 as shown on the left inset of Fig. [5] This opens up 
the possibility for a future comparison with a solution like in one dimension [4ll ] . 

Next we investigated the scaling in the steady state. This could be achieved in smaller systems with a higher 
data sampling resolution. We confirmed that the data corresponds to the steady state by visual inspection of the 
W 2 (t, L) as well as by analyzing the P(W 2 ) distribution. Similarly to the time dependence we determined the effective 
exponent of the roughness, which can be defined as the logarithmic derivative of the width 

In W(t -» oo, L) - In W(t -> oo, L/2) 
acS{L) = ln(L) = ln(L/2) " (13) 

Finite size scaling was done for systems of linear sizes in between L m in = 2 8 and I/ max = 2 13 and by considering the 
corrections using the fitting form 

a cS (L) = a + a^L^ 1 . (14) 

The local slopes of the steady state values a e s(l/L) are shown on Fig. |4] The fitting results in: a — 0.393(4), 
di = —1.24 and u>\ = 1.16. Using the a = 0.393(3) and (3 = 0.2415(15) estimates the dynamical scaling exponent 
is z — a/f3 — 1.627(26). With these values we get a + z = 2.02(3), which satisfies the scaling law expected by the 
Galilean symmetry. Fig. [5] shows a perfect data collapse with these exponents over several decades. 
We also investigated the power spectrum density (PSD) of the interface 

S(k,t) = (Mk,i)ft(-k,t)) , (15) 
where the height in the Fourier space is computed as 

MM = -Its £ *) - e M^x) ■ (16) 

x 

We computed h(k, t) from the surface profiles with the FFT method and determined S(k, t) by averaging over x and y 
directions. In the steady state the PSD is expected to scale as S ~ k~ 2 ~ 2a , which can be confirmed for 0.002 < k < 0.1 
(see inset of Fig. [5J. For larger k values we can see a growth of the S(k) function, which is the consequence of the 
lattice regularization. 



IV. PROBABILITY DISTRIBUTIONS 



The exact form of the spatially averaged height distribution P((h)) of the KPZ model in one dimension is a hot 
topic of statistical physics [343 an( l provides a definition of the KPZ universality class. The Pl(W 2 ) distribution in 
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FIG. 5: (Color online) Finite size scaling of W(L,t) for L = 2 12 , 2 13 , 2 16 , 2 17 . Right inset: PSD of the L = 2 13 system in the 



steady state. Left inset: distribution of W/t 



in the growth phase of the L — 2 system. 
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FIG. 6: (Color online) The universal scaling function 9l(x) in the steady state for L = 2 8 , 2 9 , 2 10 , 2 11 , 2 12 . The dotted line 
shows the L = 157 results of 1231. 



the steady state is also known exactly, in closed form for small and large x asymptotically [42[. In two dimensions 
not much is known about this distribution. 

In [H, [HI it was shown that the width distributions ^ L (x) = (W 2 )P L {W 2 / (W 2 )) of discrete KPZ surfaces 
exhibit universal behavior. Now we determined the probability distributions Pl(W 2 ) and calculated for 
L = 2 8 , 2 9 , 2 10 , 2 11 , 2 12 , 2 13 with the GPU code by measuring W 2 in the steady state. Averaging was done over 
N = 20 — 100 runs and 10 4 — 10 5 time steps. In case of the largest, 2 13 case the steady state averaging was done 
between 5 x 10 7 and 10 8 MCS. As Fig. [fJ]shows the data collapse is very good around x = 1, but deviations occur in 
the large and small x asymptotics due to the lack of sample points there. It is very difficult to collect enough statistics 
for the extremal cases as the width of Pl{W 2 ) grows as L 2a . 

By studying the finite size effects of extreme value statistics it was discovered [43[ that there is also universality in 
the first order (shape) correction to the limit distributions. It was also shown that if the finite size corrections of the 
cumulants can be neglected the shape corrections can be expressed via the limit distributions. To see this let us write 
Pl{W 2 ) in terms of the cumulant generating function 



Pl(W 2 



dq 

2^ 6XP 



-iq{W 2 - Kx) + 



n=2 



71! 



(17) 
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FIG. 7: (Color online) FSS of the cumulants: k x , k 2 , k 3 , k 4 (top to bottom) for L = 512,1024,2048,4096,8192. The lines 
correspond to power-law fitting with very small exponents. 



where k„-s are the n-th cumulants of W 2 (i. e. (W 2 ) = k± = vi), related to the n-th, non-central moments (v n ) as: 



K2 
K4 



2u 



3f| + I2v{v 2 - 6i/J 



Due to general scaling, the cumulants have the large L behavior 

OC L 

Let us assume that the corrections to scaling of the cumulants are of the form 

n n =L 2na (K° n + K lL-^ + ...) 



(18) 



(19) 



(20) 



To check this we determined the n = 1,2,3, 4 cumulants from W 2 data and performed a finite size scaling analysis. 
The corrections to scaling (|19[) were found to be negligible, as shown on Fig. [71 hence the universal limit distribution 
in principle can very well be approximated from the finite L results. 

Finally we tried to fit the small and large x asymptotics of ^{x) with similar functional forms that is known exactly 
in one dimension [42|. This assumption is based on the similarity of the underlying model, i.e. directed migration of 
dimers instead of particles. When we applied for the small x (x < 0.75) part of the ^l(x) the general form 



x A {B-x) exp(C/x D ) 



(21) 



we found stable nonlinear fittings with C ~ 2 and D ~ 2 in contrast to one dimension, where D = 1. This is similar 
to the small x extreme value statistics of the 1// Q noise, where one obtains exp(— a/x^) with /? depending on a. We 
fixed C = 2, D = 2 and tried to get a general form with integer coefficients up to the second order. We obtained 



aT a (10 - x) exp(-2/V)(l + x~ ia (9 - x) exp(-9.25/x^)) 

in good agreement with the L = 2048 data as shown on Fig. [8] 

For the lager x part we assumed again the form of the one dimensional model 

E exp(Fx)/x 



(22) 



(23) 



and obtained a nice agreement with: E = 2.915, F = —2.572. The least squares fit error was smaller that by a 
stretched exponential ansatz. 
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FIG. 8: (Color online) Small (left) and large (right) asymptotics of ^l(x). Dotted (blue) line corresponds to L — 157 of pTj ]. 
red line L — 2048 data, dashed line: fitting with the form ()22|) . Right: linear- logarithmic fitting (dashed line) to the large x 
part of the same data. 



V. CONCLUSIONS AND DISCUSSION 



In conclusion we have developed a bit-coded CUDA program, which simulates very efficiently a 2 + 1 dimensional 
discrete KPZ growth model (the octahedron model [28[) via binary lattice gases. Using this tool we could achieve 
extraordinary large sizes up to 2 17 x 2 17 with a maximum speedup 240 on NVIDIA Fermi Cards with respect to 
a single 3 GHz CPU core. This allows us to resolve debates over the scaling exponents by performing very precise 
scaling analysis of the interface width. 

Our growth exponent estimate f3 = 0.2415(15) is somewhat bigger than the results of [HJ (/3 = 0.221(2)), [45[ 
(P = 0.229(5)) and [H[ (/3 = 0.240(1)). It matches our former estimates for this model 0.245(5) [Hj], but excludes 
definitely the j3 = 0.25 field theoretical result. We also estimated the leading order correction to scaling exponent: 
01 = 1.05. 

The independent roughness exponent result a = 0.393(3) is in the middle of the range obtained by various numerical 
exponent estimates: i.e. between a = 0.36 0, [4^] a = 0.385(5) [Iti ] and the a = 0.4 field theoretical result, well out 
of error margin. This agrees with our former a = 0.395(5) [29[ and with the simulation results of [2l| (a = 0.393(3)). 
In the latter case even the correction to scaling exponent wi = 1.16 is the same. 

We analyzed the surface in the steady state by the power spectrum density method and confirmed the KPZ scaling 
for several decades above the lattice cut-off. We determined the universal scaling function and the cumulants of 
the surface width for different sizes and obtained the limit distribution via correction to scaling analysis. We gave 
analytical fitting for the small and large asymptotics. As compared to one dimension (42j, where a linear x dependence 
in the exponential is known exactly, we found x 2 dependence for small x. For the large x deviations the exjp(Fx)/x 
tail fits better than a stretched exponential functions as suggested in [481 ] . 

Our model and code proves to be a very efficient tool to study not only the (2 + 1) dimensional KPZ and ASEP 
models but, more generally it can be used in the research of fundamental nonequilibrium thermodynamical quantities 
like the large deviation function or entropy production [49l. l50j . It is also straightforward to extend it to study 
more complex system exhibiting pattern formation [51L |52|. the effect of quenched disorder [54| , the time-dependent 
structure factor [53[ or to higher dimensions [29[ . 

One may also ask if the results for this discrete model describe those of the continuum KPZ equation. In fact 
this is not an obvious question in d > 1 dimensions f55|. However, we think that the irrelevancy of anisotropy by 
renormalization group studies [561 ] excludes this in two dimensions and we find (2 + 1) dimensional KPZ universality 
class behavior. 

On completion of this study we discovered another way of accelerating our algorithm, which provides an additional 
factor of ~ 1.8 with respect to the simulations presented here. The technical details will published elsewhere \El\ . 
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